An extensible tool which can be used to extract any set of spreadsheets (typically versioned structured storage) to a table
- Currently only works on Windows OS
The spreadsheet extractor has been built to extract any data from any spreadsheet given user defined conditions. The initial intention of the tool was to aid in the extraction of data from a set of spreadsheets which are version controlled structured data cards, however the tool has been re-created into a general tool with a codebase fit for any data extraction need across a large set of spreadsheets. The tool processes each sheet/tab of every workbook given in Paths
tab, categorises it using the rules specified in the Categories
tab and then runs "Extraction Rules" based on the Rules
tab. The syntax of the rules is a superset of stdLambda
syntax, with additional injected functions giving it a formula-like feel. The stdLambda
syntax is loosely based on VBA's syntax. Please see Section 2 for instructions on how to engineer this this spreadsheet to extract data from your spreadsheets.
The initial point of call is to set up the categories which are required for the extraction of your data. Categories are used by the tool to identify specific worksheets of workbooks which require data extraction. Categories have been added to allow you to extract data differently from multiple spreadsheets dependant on their category. For example, categories can be set up to export data from different versions of multiple spreadsheets/cards which otherwise have different layouts and table structures. The rules table can be found on the "Categorisation Rules" spreadsheet.
A typical Categorisation Rules table can be found below.
The headers aren't required in order to make the tool work. It is important that you use the name of your category in the first column (in this case the different versions of the template), and the conditions for your category to the right of this list. Each condition column is AND
-ed together to define the category. Note you may also use a single column with AND()
and OR()
formula. In fact you can use all formula functions and operators built in to Excel to determine these conditions.
In the above case if I4
is "Version: 1.0.0" and J8
is not "Product Photo" then the worksheet will be categorised as "1.0.0". Alternatively if I4
is "Version: 1.1.0" and J8
is "Product Photo" then the worksheet will be categorised as "1.1.0". If the worksheet matches neither "1.0.0" or "1.1.0" then this worksheet will be skipped, i.e. no data extraction will take place.
Note:
The same category name can be used for multiple different condition sets.
Now that we are categorising each worksheet we can begin extracting data from each category. We do this by defining lines in the Rules table. This table is found on the "Rules" tab. As with the categorisation rules any formula functions can be used within the table. The evaluation engine is described in section 4 of this documentation. The important thing to note is that you can export data from the sheet with simple cell references surrounded by square brackets, i.e. [E11]
. You can also access the targetSheet directly, which gives you access to the worksheet object. You can also access range objects themselves with r[A1]
for instance.
A typical set of Extraction Rules can be found below:
The column names provided in this table directly correspond to the column names of the table which will appear in the output. Cell references and formula can be used to specify which cells the algorithm should export data from given the sheet for the categories listed on the left. It is important that the Category column remains as the 1st column in this table. All successive columns are user defined.
Note:
Unlike the categorisation rules, there is little reason to extract multiple categories of the same name at once. For this reason it is suggested that all category names in this table are unique, in order to help process data outputs.
After categorisation and extraction rules have been implemented, running extractions from cards is easy!
- Navigate to the Paths sheet
- Paste the file paths of all Excel spreadsheets that you want to extract data from.
- Click the run button
The tool will now execute and extract data from all workbook paths where Extracted <> "Yes"
Note:
You can quickly produce a file list from a folder in windows explorer by selecting all files you want to analyse, holding down the shift key down, right clicking on a file and selecting the "copy as path" option. This will copy the file paths of all selected paths to the clipboard.
The syntax of the formula engine is an important part to understand about the way this tool works. The formula engine is an extended version of stdLambda
who's official documentation can be found here: https://github.com/sancarn/stdVBA/blob/master/docs/stdLambda.md
The sheet which the rule is currently being ran on can be accessed via the variable named targetSheet
A value can be accessed from the sheet using the [...]
syntax: [A11]
will extract the data held in A11
(VBA equivalent of targetSheet.range("A11").value
).
A range object can be accessed from the sheet using the r[...]
syntax: r[A11]
will return a range object for A11
(VBA equivalent of targetSheet.range("A11")
).
After a range object has been obtained it can be acted upon, for instance the color of the cell can be obtained via the interior.color property: r[A11].interior.color
Additionally properties of the sheet object can be accessed too. For instance the name of the sheet can be accessed targetSheet.name.
This also allows finding specific text on a sheet: targetSheet.Find("SomeText")
; or within a range: r[A1:F22].Find("SomeText")
Information from the workbook can be found with targetSheet.parent
e.g. The workbook's name: targetSheet.parent.name
You can also access information about the current spreadsheet extractor using ThisWorkbook
e.g. The workbook's name: ThisWorkbook.name
Numerous functions are injected into the runtime of the formula. For instance all functions and properties within Application.WorksheetFunction
are accessible directly documented here.
This allows you to use functions like sum() to sum over a range: sum(r[A11:A21])
Additionally functions defined in xlLambdaEx
are included in the runtime. This allows further extensibility in the system than what formulae alone can provide. A list of the standard functions is included below.
In some cases more complicated expressions are required. The expression language allows for multiple lines of code within a single expression, either seperated by whitespace or by a :
character. This means scripts which obtain information can be used.
let x = getIntersectingShapes(r[A11:F24])
if x.count = 0 then
"No shapes"
else
x.count + " shapes"
end
stdRegex
has been injected into the xlLambdaEx
runtime, which can be used to extract information from strings.
Custom Functions included
getGUID() as string
'Generates a new GUID
'@returns {string} a new random GUID
getIntersectingShapes(rng As Range, Optional exitOnFind As Boolean = False) as Collection<Shape>
'Obtains a collection of shapes which intersect with a range
'@param {Range} The range to search for shapes within
'@param {Optional Boolean=False} Exit on the first shape found. Set this to true if you're only interested in finding 1 shape.
'@returns {Collection<Shape>} A collection of all intersecting shapes
exportShapeToFile(shp As Shape, sFileDir As String, sFileName As String) as String
'Exports a single shape to a PNG file.
'@param {Shape} Shape to export to file as PNG
'@param {String} Directory to export file to
'@param {String} New file name
'@returns {String} The full file path of the newly created PNG
exportShapeCollectionToFiles(shapes As Collection, sFileDir As String, sBaseName As String) As String
'Exports a collection of shapes to a PNG files.
'@param {Collection} Collection of shape objects to export to folder of PNG files
'@param {String} Base directory to create export folder within
'@param {String} Base name of new files
'@returns {String} The path to the folder containing all exported images
'@example ShapeCollectionToFiles(getIntersectingShapes(r[A1:M24]), "C:\Temp", "Image")
OffsetMerged(rng As Range, Optional rows As Long = 0, Optional cols As Long = 0) As Range
'Like `WorksheetFunction.Offset` but considers merged cells, which can be very useful when working with structured data formats.
'@param {Range} The range to perform the offset on
'@param {Long=0} Number of merged rows to offset by
'@param {Long=0} Number of merged columns to offset by
'@returns {Range} The resulting range after offsetting by merged rows and columns.
In order to extend the functionality with custom functions, add custom functions to xrExtractor.cls
. These will be imported dynamically into the formula engine.
At a high level:
xrExtractor
module is the main entry point for this toolxrExtractor
receives a stdEnumerator of file path entries from Paths table.xrExtractor
also creates a stdArray which all results are pushed toward.- With each path,
xrExtractor
opens the workbook and for each sheet in the workbook it runs:xrCategories
is run defining the category of the workbook from rules established in Categories table- If a category is found then
xrRules
are ran to extract an object containing results information.
- Both
xrCategories
andxrRules
usexrLambdaEx
as their calculation engine. xrLambdaEx
:- Is a wrapper around
stdLambda
- Extends
stdLambda
with functions from within WorksheetFunctions- Each function is wrapped with stdCallback
- Extends
stdLambda
with functions defined withinxrLambdaEx
(for extensibility)- Similarly, each function is wrapped with stdCallback
- Uses regex to rewrite the syntax
r[A1]
and[A1]
astargetSheet.Range("A1")
andtargetSheet.Range("A1").value
respectively.
- Is a wrapper around
- Basic extractor
- Allow extraction of images
- Allow extraction to multiple tables (i.e. database)
- Consider different category types, like a cell structure category rather than a sheet category.
TBC